Assignment 1: About the project

Here’s the link to the repo on github: https://github.com/oj-lappi/IODS-project

My background and motivations

I’d like to learn about statistical tests and possibly how to add R to data pipelines.

I don’t think R will replace my current tools for analyzing and plotting data, but I do want to at least see how feasible it would be to add R as a tool for developing statistical analyses for experiments, and possibly writing some small programs for quick statistical analysis in postprocessing of experiments.

I’m familiar with git, it’s a daily driver for me, I tend to use gnuplot and python for quick plotting, and either use python, UNIX tools or custom C++ programs for data analysis utilities.

Goals

I’m interested in creating automatic data pipelines that process data in object storage and upload plots to a dashboard. R+Github pages might be a good way to do that, although I would prefer an on-prem solution (HU’s gitlab maybe?) or a self-hosted website.

RStudio/Git integration

I have ssh keys set up on each of my computers, and a personal access token would be less flexible than that.

I can push and pull with this setup from RStudio.

Exercises

I checked the exercises and followed some of them along locally, I think I’m getting the hang of it :).

# Timestamp:

date()
## [1] "Tue Nov 28 01:27:01 2023"

Assignment 2: Linear regression on the Learning2014 dataset

1. Reading the data

#Dependencies
#install.packages(c("readr","lmtest", "dplyr"))
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Timestamp
date()
## [1] "Tue Nov 28 01:27:01 2023"

2. Analysis

lrn2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1 The shape of the data

## [1] "dimensions: 166" "dimensions: 7"

There are 166 rows with 7 columns each.

The spec is:

## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )

2.2 Descriptive statistics

Sex distribution

gender seems to be a categorical value, so let’s see the number of rows per gender (sex):

## # A tibble: 2 × 2
##   gender count
##   <chr>  <int>
## 1 F        110
## 2 M         56

110 F, and 56 M, there is a skew towards female students in the dataset. Let’s plot that.

## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
fig. 1.1, sex distributions

fig. 1.1, sex distributions

Age distribution

Let’s plot the age distribution of the students.

lrn2014 %>%
  ggplot(aes(x = age)) +
  stat_count()
fig. 1.2, age distributions

fig. 1.2, age distributions

min(lrn2014$age)
## [1] 17
max(lrn2014$age)
## [1] 55
median(lrn2014$age)
## [1] 22

The age range is from 17 to 55, and the median is 22. Visually inspecting the distribution, the mode of the distribution is early twenties, as you would expect, although there is a long tail.

Age-sex distribution

Let’s combine the two columns into a classic population pyramid, or age-sex pyramid.

But that’s not exactly what we want. It turns out a population pyramid is not an out-of-the-box plot we can easily produce, we have to manually calculate bins and do some magic.

lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  group_by(gender, age_bin) %>%                      # Group by bin and gender
  summarize(count =n()) %>%                          # Sum over groups
  mutate(count = 
      if_else(gender == 'M', -count, count)) %>%     # Turn one negative
  ggplot(aes(x=count, y = age_bin)) +
  geom_col(aes(fill=gender)) 
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3a, population pyramid

fig. 1.3a, population pyramid

There are very few male students under 20, I would speculate that this is due to Finnish army conscription, otherwise the distribution seems roughly equal on the female and male sides.

We can of course bin by year instead of 5 years, and we get a higher resolution but more noise.

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
fig. 1.3b, population pyramid #2, fine-grained bins

fig. 1.3b, population pyramid #2, fine-grained bins

There is one peculiar decline in the female student participation around ~26-28 which jumps back after thirty. This might be a maternity effect, but this is highly speculative, there’s very few samples in this dataset.

Exam scores

Let’s look at exam scores:

paste("median:", median(lrn2014$points), ", mean:",mean(lrn2014$points), ", standard deviation:", sd(lrn2014$points))
## [1] "median: 23 , mean: 22.7168674698795 , standard deviation: 5.89488364245529"

Let’s look at the full distribution usin geom_density.

lrn2014 %>%
  ggplot(aes(x=points)) +
  geom_density()
fig. 1.4a, exam score distribution

fig. 1.4a, exam score distribution

There’s a curious valley in the density at around 12-14 points. Let’s look closer.

lrn2014 %>%
  ggplot(aes(x=points, tickwidth=1)) +
  geom_histogram(boundary=0,binwidth=1) +
  scale_x_continuous(breaks = seq(0, 40, by = 1))
fig. 1.4b, exam score histogram

fig. 1.4b, exam score histogram

So no students got 12,13, or 14 points. The jump from 11 to 15 must be behind some barrier. Let’s look at our two demographic variables.

Exploring exam score distributions for different groups

Point means by sex, with age gradient
lrn2014 %>%
  group_by(gender) %>%
  ggplot(aes(x=points)) +
  geom_histogram(boundary=0,binwidth=2, aes(fill=factor(age))) +
  facet_wrap(~gender)
fig. 1.5a, exam scores by sex, with age gradient

fig. 1.5a, exam scores by sex, with age gradient

I see no clear bias either way, perhaps a slight correlation with score and age within the mode (20-30 years) and then no correlation for higher ages. The female distribution seems most well-behaved, although the gap from 11 points up is much sharper here as well.

Point means by age group, with sex
lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  group_by(gender, age_bin) %>%
  ggplot(aes(x=points)) +
  geom_histogram(boundary=0,binwidth=2, aes(fill=gender)) +
  facet_wrap(~age_bin)
fig. 1.5b, exam scores by age group, with labeled sex

fig. 1.5b, exam scores by age group, with labeled sex

Point means by age group, box plot
lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  ggplot(aes(x=age_bin, y = points)) +
  stat_boxplot(geom = "errorbar", width = 0.25) +
      geom_boxplot()
fig. 1.6a, exam score distributions by age group, box plot sequence

fig. 1.6a, exam score distributions by age group, box plot sequence

lrn2014 %>%
  mutate(age_bin = cut(age, breaks=seq(0,60,5))) %>% # Bin by age
  ggplot(aes(x=age_bin, y = points)) +
  stat_boxplot(geom = "errorbar", width = 0.25) +
      geom_boxplot()+
  facet_wrap(~gender)
fig. 1.6b, exam score distributions by age-sex group, box plot sequence

fig. 1.6b, exam score distributions by age-sex group, box plot sequence

I see no correlation between age and exam score in these plots.

2.3 Correlation matrix

Finally, let us look at all the survey question scores together with the already explored variables in a correlation matrix.

lrn2014 %>%
  select(gender, age, surf, stra, deep, attitude, points) %>%
  ggpairs(aes(fill=gender, color=gender),lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 1.7, correlation matrix of all variables

fig. 1.7, correlation matrix of all variables

There seem to be negative correlations between: - surf and deep (but seemingly only strongly for male students) - attitude and deep (weak, but stronger for male students again) - stra and surf (weak)

And a possitive correlation between points and attitude! This is the strongest linear relationship in the data. And we can verify that age does not seem to have an effect at all.

There also seems to be a relationship between attitude and gender, but no relationship between attitude and points.

3. Regression

Based on the data exploration, attitude seems the most likely candidate, and next after that: stra and surf.

3.1 Simple regression as baseline

Let’s start with a simple regression model of points ~ attitude, which will be our baseline.

attitude_model <- lrn2014 %>%
                  lm(points ~ attitude, data = .)
summary(attitude_model)
## 
## Call:
## lm(formula = points ~ attitude, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is clearly a statistically significant relationship between attitude and points. But R-squared is only around 18.5%, so there is a lot of variance not explained by the model.

3.2 Multiple regression, 3 variables

three_var_model <- lrn2014 %>%
                  lm(points ~ attitude + stra + surf, data = .)
summary(three_var_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The adjusted R-squared is higher, which means our model is capturing more of the underlying interactions than before, although still below 20%.

It seems that the relationship between points and surf is not statistically significant.

Let’s drop surf and keep stra, and try again.

3.3 Multiple regression, 2 variables

two_var_model <- lrn2014 %>%
                  lm(points ~ attitude + stra, data = .)
summary(two_var_model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Right, this is our best fit yet, based on the adjusted R-squared. Depending on what our choice of significance level would be in a hypothesis test, the interaction with stra would be ignored. At a standard level of a = 0.05, we wouldn’t reject the null hypothesis that stra has a linear effect on points.

Let’s test another model, with nothing but stra.

3.4 Simple regression with only stra

stra_model <- lrn2014 %>%
                  lm(points ~ stra, data = .)
summary(stra_model)
## 
## Call:
## lm(formula = points ~ stra, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5581  -3.8198   0.1042   4.3024  10.1394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.233      1.897  10.141   <2e-16 ***
## stra           1.116      0.590   1.892   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared:  0.02135,    Adjusted R-squared:  0.01538 
## F-statistic: 3.578 on 1 and 164 DF,  p-value: 0.06031

We get very close to a statistically significant result at a standard significance level of 0.05, but not quite. Let’s drop stra, since that’s what the assignment asked us to do.

The model we will be using is therefore the baseline model with one predictor: attitude.

4. Interpretation

4.1 Summary

Let’s rerun that summary:

summary(attitude_model)
## 
## Call:
## lm(formula = points ~ attitude, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The fitted regression coefficients are: - intercept: 11.6372 - attitude: 3.5255

Which means the that the model predicts the conditional mean of the exam scores, given an attitude value as:

points = 11.6372 + 3.5255*attitude

If we multiply the attitude coefficient with the range of the attitude in the population, we can get an idea of how the model assigns expected exam scores based on a students attitude.

am <- mean(lrn2014$attitude)
as <- sd(lrn2014$attitude)
print("Range of predictor term within a sd:")
## [1] "Range of predictor term within a sd:"
3.5255*c(am-as, am, am+as)
## [1]  8.506549 11.079839 13.653130
print("Range of ŷ_i within a sd:")
## [1] "Range of ŷ_i within a sd:"
11.6732 + 3.5255*c(am-as, am, am+as)
## [1] 20.17975 22.75304 25.32633
print("Range of ŷ_i within two sd's:")
## [1] "Range of ŷ_i within two sd's:"
11.6732 + 3.5255*c(am-2*as, am, am+2*as)
## [1] 17.60646 22.75304 27.89962
spec(lrn2014)
## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )

So, assuming attitude is Gaussian and that the sample stddev is a good estimate, the model assigns: an exam score: - between 20.17975 and 25.32633 to a majority of students (about 68%, one stddev in a Gaussian captures 34.1% of the population) - between 17.60646 and 27.89962 to a super-majority (95%) of students

If we look back at the exam score distribution in figure 1.4, this does capture the mode of the distribution.

3.2 Multiple R-squared

The multiple R-squared value is 0.1906, the standard way to express this is that “19% of the variation in exam scores is explained by the attitude variable” (see MABS4IODS, 3.2.1).

I would interpret this to mean that, using the linear model, given attitude we estimate the expectation of the standard error (squared) of this prediction to be roughly 80% of what it would be when simply using the sample mean as a predictor. The estimation assumes the sample is representative, because we’re using residuals to get this number.

I’m not quite sure if this more elaborate interpretation is exact, but it’s what I was able to piece together from sources online, mostly wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination).

5. Regression diagnostics

par(mfrow=c(2,2))
plot(attitude_model)
2.0 Regression diagnostics

2.0 Regression diagnostics

par(mfrow=c(1,1))

5.1 Assumptions

The assumptions of the model are that: 1. the constant variance assumption (homoscedasticity) 2. the normality assumption: 3. there’s a linear relationship between the predictor and the response

The Residuals vs. Leverage

The Residuals vs. Leverage plot checks that there aren’t any influential outliers that are affecting the fit of the regression coefficients. The plot has a dashed line showing a critical Cooks’s distance value. In our case this dashed line is not visible. Essentially, if a point is very far right and has a high standardized residual (off the central line), it’s an higly influential point and will have to be looked into.

A highly influential point may be an indication of a point that should be excluded, but this has to be done on a case-by-case basis. It might also mean that assumption 3 is violated, that there isn’t a linear relationship between predictors and the response.

The Residuals vs. Fitted plot

The residuals vs fitted plot (and the scale-location plot) gives us a visual way to check for heteroscedasticity (violation of assumption 1). If the red mean-line is not horizontal, it means the residuals have a bias in some region of the response distribution (the vairance is not constant).

I don’t think this means the data is heteroscedastic, it certainly doesn’t look like it is. But I’m not so familiar with these visual checks, so I searched the web for a homoscedasticity test, and found the Breusch-Pagan Test in the lmtest library.

attitude_model_bptest <- bptest(attitude_model)
attitude_model_bptest
## 
##  studentized Breusch-Pagan test
## 
## data:  attitude_model
## BP = 0.0039163, df = 1, p-value = 0.9501

A p-value of 0.95 means that there is definitely very little evidence that the model is heteroscedastic. Ok, good! The trend has to be much clearer than that then.

Normal QQ-plot

This plot compares the “ideal” quantiles to the sample quantiles. This is used to test for normalcy by comparing the residual distribution against a theoretical normal distribution.

There are some outliers at the bottom left, which may indicate a bimodal distribution (remember that the exam scores looked bimodal as well, fig 1.4). The plot also curves down a little at the upper end, which I believe usually indicates a light left skew.

let’s test the residuals for normalcy, using the Shapiro-Wilk test:

shapiro.test(attitude_model[["residuals"]])
## 
##  Shapiro-Wilk normality test
## 
## data:  attitude_model[["residuals"]]
## W = 0.97394, p-value = 0.003195

The p-value is quite small, 0.003, so we to reject the null hypothesis that the residuals are normally distributed.

Hmm, maybe the issue is the grading scale? I’ve tried to fix this, and I can ge the QQ-plot to look nicer with some transformations, but it makes the model test statistics worse. This is the best I can do for now.


Assignment 3: Logistic regression

1. Get the data

library(dplyr)
library(readr)
data_url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
df <- read_csv(data_url)
## `curl` package not installed, falling back to using `url()`
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Summary of variables

The variables in the data set are:

colnames(df)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset contains student school performance data averaged over two courses: portuguese and mathematics (the variables are: G1, G2, G3, absences, failures, paid). To be exact, the “paid” column is only from one of the courses. The other variables are demographic and social, the social variables were collected using surveys.

Two variables: alc_use and high_use are transformations of the original dataset added for this assignment. alc_use is the average alcohol use per day, combining self-reported weekday and weekend usage on a scale of 1 to 5. high_use is a boolean indicating whether the alc_use variable is more than 2.

3. Choose 4 interesting variables in the data to study in relation to alcohol use

I choose:

  • Fedu: Father’s education level
  • Medu: Mother’s education level
  • absences: number of absences on average per course
  • G3: final course grade average

My hypotheses is:

  • parents education levels (Fedu,Medu) has a small but significant impact on the likelihood of the high_use variable.
  • increase in absences incrreases likelihood of high alcohol usage
  • decrease in grades increases likelihood of high alcohol usage

4. Explore variables

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)
library(ggplot2)
library(GGally)
df %>% select(alc_use, high_use, Fedu, Medu, absences, G3) %>%
  ggpairs(aes(fill=high_use, color=high_use))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig. 1.1, correlation matrix of variables, colored by high_use value

fig. 1.1, correlation matrix of variables, colored by high_use value

  • It turns out that the dataset shows absolutely no correlation between parent’s education and alcohol usage.
  • Fedu is very highly correlated with Medu though, which is not surprising but interesting to note, people seem to marry within their social class.
  • absences has a very high correlation with alcohol use
  • grades have a slightly smaller negative correlation with alcohol use

Two out of four hypotheses seem to have some support in the data after this exploration.

5. Logistic regression

Let’s do a logistic regression model using the two statistically significant correlations: G3 and absences.

model <- glm(high_use ~ G3 + absences, data = df, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38732    0.43500  -0.890 0.373259    
## G3          -0.07606    0.03553  -2.141 0.032311 *  
## absences     0.08423    0.02309   3.648 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.93  on 367  degrees of freedom
## AIC: 435.93
## 
## Number of Fisher Scoring iterations: 4

Both G3 and absences have a p-value of less than 0.5 for the Wald tests of the fit, but we have a better test that’s easier to interpret, the confidence intervals from the model.

5.1 Odds ratio

coeffs <- coefficients(model)
odds_ratio <- coeffs %>% exp
odds_ratio
## (Intercept)          G3    absences 
##   0.6788751   0.9267616   1.0878762

The odds ratios for G3 and absences are roughly 0.926 and 1.088 For G3, the OR is less than one because the correlation between the variables is negative.

The way to interpret these is that: - for each decrease in final grade average, the odds increase by roughly 8.0% that a student has high alcohol use (1/0.9267 ~= 1.08) - for each increase in absences per course, the odds increase by roughly 8.7% that a student has high alcohol use

But that’s just the average effect the model has fit, let’s look at confidence intervals in the odds ratio

5.2 Confidence intervals

ci <- confint(model) %>% exp
## Waiting for profiling to be done...
ci
##                 2.5 %    97.5 %
## (Intercept) 0.2857593 1.5832545
## G3          0.8639686 0.9935498
## absences    1.0419618 1.1407755

The 95% confidence intervals are both on one side of unity, so we can say with 95% certainty that there is an effect for both variables, and the effect increases the odds by: - a factor in the range of [1.007, 1.15] for each decrease in final grade average (again,inverting the odds ratio because the effect is negative) - a factor in the range of [1.04, 1.14] for each increase in absences per course

5.3 Interpretation

There is an effect, although the final grade average effect goes damn near 1 at the low end of the confidence interval, the unit increase in odds is only 0.7%! For absences, the confidence interval is a little tighter, but it still seems a little wide to me for practical use. We would need more samples in order to get a tighter interval.

The two hypotheses that survived the initial exploration both match the outcomes of the logistic regression, and now we can quantify the quality of an estimate of the effect.

6. Predictions

Let’s test out the model with a confusion matrix.

probabilities <- predict(model, type = "response")
df <- df %>% 
        mutate(probability = probabilities) %>%
        mutate(prediction = probability > 0.5)

table(high_use = df$high_use, prediction = df$prediction) %>%
        prop.table %>%
        addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.25675676 0.04324324 0.30000000
##    Sum   0.93783784 0.06216216 1.00000000

The confusion matrix tells us that the model has a false positive rate of ~2% and a false negative rate of ~26%. That’s pretty good! High false negative rates are not so bad, they are just missed cases. High false positives would mean that the model is unreliable and cannot be used as an indicator (in this case. The importance of different error types depend on the specific use case and the meaning of negative and positive).

On the other hand, the confusion matrix also tells us that the model preficts 94% of all students to have non-high alcohol use, while in reality the number is 70%. So the model achieves this relative safe indicator status by being rather conservative.

Caveat

We haven’t done a split into a model fit dataset and a validation set, so this confusion matrix is of limited value.


Assignment 4: Classification and Clustering

1. Overview

This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.

The methods used are linear discriminant analysis (LDA) and k-means clustering.

2. The dataset

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Dataset structure

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.

The columns contain social statistics related to these census areas (e.g. crim = crime rate, ptratio = pupil-teacher ratio), data about the housing units in the area (rm = avg # of rooms per unit, medv = median housing unit value, age = prop. houses built before 1940), and data about the location of the area (e.g. dis = weighted mean of distances to employment centers, chas = 1 if by Charles River, 0 if not by Charles River).

Some of the columns are a little counter-intuitive or difficult to interpret. E.g., the column age is the proportion of houses built before 1940, and the column lstat is the proportion of the population that is lower status. From the Harrison & Rubinfield paper, lower status means: “1/2 * (proportion of adults without some hig h school education and proportion of male workers classified as laborers)”.


Ok, before we move forward, we did see a small issue here, let’s change chas from an integer to a boolean.

library(dplyr)
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Summary

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
##       crim                zn             indus          chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Mode :logical  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   FALSE:471      
##  Median : 0.25651   Median :  0.00   Median : 9.69   TRUE :35       
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Looking at this summary, all columns are definitely not created equal. I am mostly looking at the difference between the mean and median, which — if they differ by much — can indicate that a variable is not normally distributed. Some of the columns are close to normal distributions, but e.g. zn has a median of 0 and a mean of 11.36. Other highly skewed columns are rad, crim, tax, chas, and black.

3. Graphical summary

With ggpairs

library(ggplot2)
library(GGally)
Boston_explore %>%
  ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix

fig. 3.1, Correlation matrix

Ok, lot’s to unpack here, let’s start with the a visual check of each variable’s distribution (the diagonal). Almost none of the columns look normally distributed, with perhaps the exception of rm, the number of rooms.

There are lots of interesting correlations, just looking at the scatter plots, the three values rm, medv, and lstat seem to have strikingly strong relationships with each other with medv, which makes sense to me.

The rad variable, which is an “index of accessibility to radial highways, is clearly a bimodal, or one could even say a split distribution. A subset of areas have a much higher index than the others, and in the scatter plots, this clearly visible line of that higher-index population seems to consistently cover different ranges of the other variable than the lower-index population. The effect is most clearly noticeable in the crim, tax, nox, dis and black scatter plots.

dis and nox also have a strikingly logarithmic-looking relationship.

In general, nearly every variable seems to be correlated with every other variable, excepting the chas (area is by the Charles river) column.

With corrplot

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot

fig. 3.2, Correlation matrix with corrplot

Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.

We see strong correlations (large balls) between: - dis and (zn, indus, nox, and age) - tax and rad, and this is a very strong correltaion, they seem to capture much of the same variation within them - tax and (crim, indus, and nox) - ditto for rad - lstat and medv

4. Standardize and summarize

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

Let’s run the scale function on the dataset.

Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.

Create a categorical crime rate column

# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.

Create training and test datasets

set.seed(179693716) 
n <- nrow(Boston_scaled)
split <- sample(n,  size = n * 0.8)

train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]

Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.

5. Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2475248 0.2351485 0.2623762 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.85485581 -0.8956178 -0.08120770 -0.8436157  0.4413548 -0.8603410
## med_low  -0.09212737 -0.2763238  0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309  0.1791469  0.26643202  0.3710782  0.1054813  0.4474274
## high     -0.48724019  1.0170298 -0.04947434  1.0820880 -0.4230980  0.8279971
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8290333 -0.6897369 -0.7642825 -0.43374243  0.3887547 -0.77819536
## med_low   0.3902633 -0.5443053 -0.4238660 -0.08385135  0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126  0.1295279  0.06516648
## high     -0.8564037  1.6390172  1.5146914  0.78181164 -0.8395683  0.90684062
##                 medv
## low       0.52171440
## med_low  -0.04934231
## med_high  0.14052867
## high     -0.73164690
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12991855  0.717286290 -1.11409479
## indus    0.01318562 -0.275547338  0.12479393
## chas    -0.08510193 -0.049959856  0.17108503
## nox      0.50033989 -0.751262885 -1.27531579
## rm      -0.10015731 -0.108297341 -0.19190232
## age      0.21057791 -0.358696692 -0.14782657
## dis     -0.05526417 -0.347261139  0.38293342
## rad      3.16593191  1.032967549 -0.36643042
## tax     -0.01168751 -0.096843332  1.10072573
## ptratio  0.12524469  0.004459516 -0.41009220
## black   -0.12747580 -0.014860435  0.07573253
## lstat    0.22137355 -0.316638242  0.31488347
## medv     0.17976929 -0.444140572 -0.22238434
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9509 0.0340 0.0151

Biplot

arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = scale * heads[,choices[1]], 
         y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
  text(scale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot

fig. 5.1 LDA biplot

We see that out of the two first linear discriminants, LD1 nearly perfectly separates the data into two clusters: those with high crime rate, and those with other values. rad has the clearly highest coefficient in LD1, which can be seen both from the biplot and the LDA summary.

LD2 seems to find another axis within the data that explains a smaller effect. The largest coefficients in LD2 belong to nox, medv, and zn.

6. Validation in test set

# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)

Let’s predict the crime rate quartiles in the test set and cross tabulate:

# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)

# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
##           predicted
## correct    low med_low med_high high
##   low       15       9        0    0
##   med_low    5      13        8    0
##   med_high   1      10       19    1
##   high       0       0        0   21
nrow(test)
## [1] 102

and here’s the same table as a confusion matrix:

image(tab)
fig. 6.1 Confusion matrix of the LDA fit

fig. 6.1 Confusion matrix of the LDA fit

The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).

68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!

6.Extra: let’s try with only rad

lda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit

fig. 6.2 Confusion matrix with a single variable LDA fit

Using only rad gives us a model that seems to have exactly the same predictive power in the high vs not high case, but loses information in the lower quartiles. This fits with our earlier analysis of how LD1 was mostly rad and was able to carve out most of the high crime rate areas from the rest.

7. K-means

The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.

I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.

# determine the number of clusters
#k_max <- 10

# calculate the total within sum of squares, take 5 samples to stabilize the variance 
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})

df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))

df <- df %>% rowwise() %>% mutate(twcss = mean(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
  geom_line() +
  geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
  theme(legend.position="none") +
  scale_x_continuous(breaks=df$k) +
  scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot

fig. 7.1 k-means twcss plot

It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.

k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.

I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.

Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       28    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k2m$centers
##         crim         zn      indus         chas        nox         rm
## 1 -0.3894158  0.2621323 -0.6146857  0.002908943 -0.5823397  0.2446705
## 2  0.7238295 -0.4872402  1.1425514 -0.005407018  1.0824279 -0.4547830
##          age        dis        rad        tax    ptratio      black      lstat
## 1 -0.4331555  0.4540421 -0.5828749 -0.6291043 -0.2943707  0.3282754 -0.4530491
## 2  0.8051309 -0.8439539  1.0834228  1.1693521  0.5471636 -0.6101842  0.8421083
##         medv
## 1  0.3532917
## 2 -0.6566834

It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.

ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

It seems that the model has picked two clusters that have the following relative position to each other:

  • the red cluster has a much lower crime rate
  • the red cluster has a lower radial highway access index
  • the red cluster has a lower tax rate
  • the red cluster has a much higher proportion of black residents
  • the red cluster has a lower proportion of buildings built prior to 1940
  • the red cluster has a similar median value distribution, shifted towards a higher evaluation (excepting a blue bump right at the top of the evaluation range)
  • the blue cluster has a much smaller distance to employment centers
  • the blue cluster has a much smaller proportion of residential land zoned for large plots
  • the red cluster has a much smaller proportion of single room apartments and other small housing units
  • the red cluster has a smaller proportion of working class people and non-educated adults

So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.

So it seems like the red regions are new developments, new suburbs farther away from the city, and there may be some price discrimination in the house prices (medv) connected with the high proportion of black residents living there. You can see effects of segregationist US housing policy in the data. E.g., the 1949 housing act set up a framework to subsidize public housing for whites with clauses forbidding resale to black people, which means that black people paid more for housing (see e.g. “Abramovitz & Smith, The Persistence of Residential Segregation by Race, 1940 to 2010: The Role of Federal Housing Policy,Families in Society, Vol. 102, Issue 1” for more).

7.2 Let’s try with k=3

Why not try with k=3 for good measure? Maybe we can find additional structure in the data.

set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
k3m$centers
##         crim         zn      indus        chas        nox         rm        age
## 1 -0.4076669  1.1526549 -0.9877755 -0.10115080 -0.9634859  0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208  0.07398993 -0.1662087 -0.1700456  0.1673019
## 3  0.8942488 -0.4872402  1.0913679 -0.01330932  1.1109351 -0.4609873  0.7828949
##           dis        rad        tax     ptratio      black       lstat
## 1  1.05650031 -0.5965522 -0.6837494 -0.62478941  0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655  0.2680397 -0.05818052
## 3 -0.84882600  1.3656860  1.3895093  0.63256391 -0.7083974  0.90799414
##          medv
## 1  0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs

fig. 7.1 k-means pairs

Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.

The differences in the red and green clusters seem to be: - the red cluster has higher values of zn more big plots - the red cluster has a smaller proportion of older buildings - the red cluster consists of exclusively black neighbourhoods, while the green cluster has some spread - the green cluster is between the red and blue clusters when it comes to proportion of laborers and uneducated adults

Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.